-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-algorithm deterministic cuDNN convolutions #34951
Add multi-algorithm deterministic cuDNN convolutions #34951
Conversation
|
||
// A helper function to decide whether to enable deterministic cuDNN | ||
// functionality. | ||
bool RequireCuDNNDeterminism(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird for TF and XLA to ask for flags from StreamExecutor, as StreamExecutor is just a library, and it shouldn't impose constraints on users about how to use this library.
Ideally, the source of truth of "users require determinism" should be a flag in TF, and then it get passed down to XLA. Both TF and XLA pass down the same boolean to StreamExecutor. But this ideal solution creates unnecessary amount of work.
How about this:
- Create a TF flag TF_DETERMINISTIC_AUTOTUNING in
tensorflow/core/kernel/gpu_utils.h
, which is only used by TF kernels. - Create a similar XLA flag to DebugOptions in
tensorflow/compiler/xla/xla.proto
, which is used by XLA. - Keep the current StreamExecutor flags as it is privately in cuda_dnn.cc.
When the end user wants determinism, they have to compose all necessary flags. This way, at least the layering is less coupled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Tim,
Adding and Modifying Switches
I strongly prefer to not add additional determinism switches and/or change the functionality of the existing switches. TF_CUDNN_DETERMINISTIC
makes cuDNN-based ops function deterministically, and TF_DETERMINISTIC_OPS
aspires to making all ops, including those based on cuDNN, operate deterministically. Since a model either operates deterministically or not (there being no useful gray area), TF_DETERMINISTIC_OPS
supersedes, and ultimately makes fully redundant, TF_CUDNN_DETERMINISTIC
.
At the moment TF_DETERMINISTIC_OPS
is being listened to, like a global variable, in both Python code and C++ code. I'm currently working on a deterministic mode for another op, which will also be enabled by TF_DETERMINISTIC_OPS
. In time, TF_DETERMINISTIC_OPS
will be referenced in more and more places in the code-base. I envision the functionality of TF_DETERMINISTIC_OPS
eventually being replaced by a fully plumbed-in config variable.
I have been communicating the status and recipes for TensorFlow determinism here, and I don't want to make that information unnecessarily complicated for the user. Note that I intend to document TensorFlow determinism at a high level in the official TensorFlow documentation soon.
Ideal Solution
I'm not sure what you mean by "a flag in TF." Perhaps you're referring to adding something to tf.config
? Or are you just referring to what is represented by RequireCuDNNDeterminism()
(the logical OR of TF_CUDNN_DETERMINISTIC
with TF_DETERMINISTIC_OPS
)?
Following your outline above, I'm thinking that a practical way forward might be to move RequireCuDNNDeterminism()
from tensorflow/stream_executor/cuda/cuda_helpers.h/.cc
to tensorflow/core/kernels/gpu_utils.h/.cc
and use it where needed in tensorflow/core/kernels/*
including passing it down in the interface to the GetConvolve*Algorithms()
methods in the stream executor.
I currently don't know how to pass this flag down to XLA, but I could work on trying to figure it out. Then XLA could also use it and pass it to the GetConvolve*Algorithms()
methods in the stream executor.
Compromise
Instead of passing the flag down to XLA, would it be acceptable to you if I replicated the RequireCuDNNDeterminism()
code (which looks at both TF_CUDNN_DETERMINISM
and TF_DETERMINISTIC_OPS
) in tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc
? There, the setting can be both used and passed down to the GetConvolve*Algorithms()
methods in the stream executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit that removes the dependency of both core/kernels:gpu_utils
and xla/service/gpu:gpu_conv_algorithm_picker
on stream_executor
. This has been done simply by replicating the code that listens for, and caches, TF_CUDNN_DETERMINISTIC || TF_DETERMINISTIC_OPS
in both core/kernels/gpu_utils.cc
and xla/service/gpu/gpu_conv_algorithm_picker.cc
I considered passing a require_determinism
flag through the parameter lists of the CudnnSupport::GetConvolve*Algorithms()
methods (defined in stream_executor/cuda/cuda_dnn.cc
), rather than generating the flag locally within cuda_dnn.cc
, but realized that the flag was also needed for selecting the deterministic operation of cuDNN max-pooling, which is also done in cuda_dnn.cc
. There was no easy way to propagate the flag in that case.
So currently, the flag is generated independently, and equivalently, in three different places. To prevent code repetition, I would prefer for it to be in a library that can be included by stream_executor, core, and XLA. I don't know what would be the right library for that, however. I originally incorrectly chose stream_executor/cuda:cuda_helpers
because of my lack of understanding of the architecture of TensorFlow.
To re-summarize my intentions:
-
I want to keep the API for determinism as simple as possible for users, which means not changing the meaning or functionality of
TF_CUDNN_DETERMINISTIC
andTF_DETERMINISTIC_OPS
, and also extending the promise that they represent. This pull request represents a bug fix that requires those environment variables to directly control code in XLA andcore/kernels
. -
For this development work, I think it makes sense to be able to make the changes where necessary to attain the deterministic functionality as quickly as possible while maintaining a simple and consistent API. Going forward, this is mostly going to appear as different parts of the codebase listening directly to
TF_DETERMINISTIC_OPS
. Ultimately, either myself, or someone else, can go though and properly plumb the switch into all the places that have been highlighted by the use ofTF_DETERMINISTIC_OPS
, possibly from something liketf.config.deterministic_ops
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that it would be preferable to define RequireCudnnDeterminism()
once, in one place, I've been looking for an appropriate place to do that. I'm wondering if tensorflow/core/common_runtime/gpu
would make sense. I'm thinking that it could go into a new "module" (.h + .cc) called gpu_determinism
. Would it be okay for stream_executor, core, and XLA to import and use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've found another (even more?) appropriate place to define RequireCudnnDeterminism()
: tensorflow/core/util:use_cudnn
(use_cudnn.h / use_cudnn.cc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duncanriach , thanks for looking at these solutions!
I don't have a sense of priority for this PR, so I'm happy to defer the call to you. Depending on the priority, it's fine by me either to hold on this PR (as you seem to suggest) or to query the env var in multiple places, with comments describing the migration path (your original commits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're willing to take it, and it seems that you are, then I would much prefer to have a bug fix in place immediately, by querying the env var in multiple places (as specified by the current commits).
Please will you clarify what you mean by comments describing the migration path? Do you mean adding comments explaining the intention to migrate to tf.config.experimental.deterministic_ops
and associated plumbing? If so, I would gladly add that.
Also, are you happy with having the code defined by RequireCudnnDeterminism()
(and the migration-plan comment) replicated in three different places in the codebase, or would you prefer, as I would, for it to be defined in one place, such as tensorflow/core/common_runtime/gpu:gpu_determinism
? Refactoring that would be an easy and quick change to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're willing to take it, and it seems that you are, then I would much prefer to have a bug fix in place immediately, by querying the env var in multiple places (as specified by the current commits).
Yes, this is what I meant.
Please will you clarify what you mean by comments describing the migration path? Do you mean adding comments explaining the intention to migrate to
tf.config.experimental.deterministic_ops
and associated plumbing? If so, I would gladly add that.
Yes, plus verbal warning bits like "this is a temporary solution".
Also, are you happy with having the code defined by
RequireCudnnDeterminism()
replicated in three different places in the codebase, or would you prefer, as I would, for it to be defined in one place, such astensorflow/core/common_runtime/gpu:gpu_determinism
? Refactoring that would be an easy and quick change to make.
I prefer to duplicate them in several places all with the comment we talked above, also with "this code is duplicated, and should be in sync with [other files]". I don't expect frequent changes to these duplicates, so keeping them in sync shouldn't be too much work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Yes, it's very unlikely that the code will ever need to change (until it's removed). I'll work on an incremental commit to address these items. Thank you, Tim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timshen91, The most recent commit adds a detailed comment to all replicas of the function.
@duncanriach Can you please check reviewer comments and keep us posted. Thanks! |
@gbaned. I have responded to the reviewer. Please remove |
Changes pushed, and discussed here. |
@gbaned, please will you add the History:
|
…onfig plus plumbing
…tic-cudnn-convolutions PiperOrigin-RevId: 291684013 Change-Id: I818177de66eeec3dd52e276a5894a1d7a7166459
This change had to be rolled back. It seems one of our test targets became flaky with this CL. |
@akuegel, oh no! Could you tell me which test target became flaky? (adding link to roll-back commit here) |
It is an internal target, but AFAICT the target was always flaky. I'm now following up with the team internally. |
Awesome. Thank you, @sanjoy! |
I will try to roll forward again. |
Thank you, @akuegel. |
…deterministic-cudnn-convolutions PiperOrigin-RevId: 292501090 Change-Id: I31fd5aa4ed36c2929f1300250352781fca749f37
Hi @duncanriach , I think your pull introduced non-deterministic behavior of tensorflow 2.5.0. For example, I launched the tensorflow twice to do inference for my convolution graph. using following environment variables:
The 1st run cudnn autotune giving the following convolution algorithms: Do you happen to have a work around to avoid this non-deterministic behavior? Turn off autotune by TF_CUDNN_USE_AUTOTUNE, and reply on cudnnGetConvolutionForwardAlgorithm_v7 to select convolution algorithm is a feasible solution? Thanks in advance. |
Hi @Leiwu-Zheng, I believe that the code to deterministically select cuDNN convolution algorithms has evolved significantly since this pull request. @kaixih, please could you comment on this? |
Is this for the TF2 code? For the cudnn frontend API, we will use the first and working deterministic engine when |
@Leiwu-Zheng, @kaixih and I are working on this together. We'll get back to you, but it will probably be in the new year (2022). CCing @reedwm as well. Reed, my understanding is that @Leiwu-Zheng is setting Wait, what version of TensorFlow are you using, @Leiwu-Zheng? According to my notes, in version 2.0, you'll need to use Actually, scratch all of that. @Leiwu-Zheng, please will you open an issue with a clear and well-contained reproducer of the problem you're witnessing. |
@duncanriach, thanks for the investigation. I have updated my previous comment to make it more clear. I will open an issue. Currently, I am looking for a work around that version 2.5.0 can generate deterministic result for both training and inference, but TF_USE_DEFAULT_CONV_ALGO=1 only works for inference. Update: |
@Leiwu-Zheng, thank you for looking into this more deeply and for discovering this work-around: cuDNN convolution algorithm selection should be deterministic when deterministic ops are enabled (via I think this is a bug. Please will you confirm that this issue exists in the latest release (version 2.7; if possible) and open an issue (after attempting to confirm that one does not already exist for this). Provide a simple-as-possible, well-contained reproducer that demonstrates this issue on a specific version of TensorFlow. Please tag me, @reedwm, and @kaixih in that new issue, and reference this discussion. Thank you for doing all this work, Leiwu. |
This current pull request intends to address a bug in the functionality of the environment variable
TF_CUDNN_DETERMINISTIC
(see PR 24747) and also the environment variableTF_DETERMINISTIC_OPS
(see PR 31465).The current implementation (before application of this current pull request) of deterministic cuDNN convolution in TensorFlow chooses, for any layer configuration, one fixed deterministic algorithm for each of the forward and two backward propagation paths.
I have since come to appreciate that each algorithm is not guaranteed to work on all layer configurations. The solution represented by this current PR addresses that problem. It uses the existing auto-tuning mechanism to attempt to use all of the available deterministic algorithms, and then, instead of choosing the fastest one (as regular auto-tuning does), this solution chooses the first deterministic algorithm that works. It does this in a deterministic way, based on its index in a intentionally ordered array.